Processing math: 100%

$\DeclareMathOperator{\tr}{tr}$

Laplace approximation & free-energy

§ Free-energy

The Fristonian free-energy is

F=US

which comprises an energy term

U=lnp(y,ϑ|m)q=Lq

and a (negative) Shannon entropy term

S=lnq(ϑ)

§ Mean-field assumption

With mean-field approximation over the variational density, $q$

q(ϑ)=iqi(ϑi)=iqi.

Let us for now assume the following case

q(ϑ)=qi(ϑi)qj(ϑj)=qiqj,j=i

and note that the following satisfies $\delta F / \delta q = 0$

lnqi=lnZi+V(ϑi)=lnZi+L(ϑ)qj.

where $V_i=V(\vartheta_i)$ is known as the variational energy and speask to the joint probability of data and parameters, $L$, expected under its Markov blanket, $q_j$.

§ Laplace approximation

Under Laplace assumption, let us suppose that

qi=N(ϑi|μi,Σi)

are Gaussian, where

Σi=(ϑiμi)(ϑiμi)T

is the covariance matrix.

Following the factorisation, we write

F=UHU=L(ϑ)qiqjH=lnqiqi+lnqjqj

The (neg-)entropies are (see note A.1)

H=Di2(1+ln2π)12ln|Σi|Dj2(1+ln2π)12ln|Σj|=Di2ln2πe12ln|Σi|Dj2ln2πe12ln|Σj|

Further, with second-order truncation (dropping bilinear terms)

L(ϑ)L(μ)+(ϑiμi)TL(ϑi)+(ϑjμj)TL(ϑj)+12[(ϑiμi)TL(ϑiϑi)(ϑiμi)+(ϑjμj)TL(ϑjϑj)(ϑjμj)]=L(μ)+(ϑiμi)TL(ϑi)+(ϑjμj)TL(ϑj)+12{\tr[(ϑiμi)(ϑiμi)TL(ϑiϑi)]+\tr[(ϑjμj)(ϑjμj)TL(ϑjϑj)]}

where $L^{(\vartheta_i\vartheta_i)}$ denotes $\partial^2 L/\partial\vartheta_i\partial \vartheta_i$ evaluated at its mode $\mu$,

and write

U=L(μ)+12{\tr[(ϑiμi)(ϑiμi)TL(ϑiϑi)]+\tr[(ϑjμj)(ϑjμj)TL(ϑjϑj)]}=L(μ)+12{\tr[ΣiL(ϑiϑi)]+\tr[ΣjL(ϑjϑj)]}

Here, one uses the property: $a^TCa = \tr[aa^TC]$.

Thus, the free-energy becomes

F=L(μ)+12[tr(ΣiL(ϑiϑi))+tr(ΣjL(ϑjϑj))]+Di2ln2πe+12ln|Σi|+Dj2ln2πe+12ln|Σj|

Solve for $\Sigma_i$ (or just by examination through completing the square)

δF/δΣi=0=12L(ϑiϑi)T+12Σ1iΣ1i=L(ϑiϑi)T=L(ϑiϑi)


Note that the same procedure can be applied to individual variational density to give

lnq(ϑi)V(ϑi)=L(ϑi,μj)+12(ϑjμj)TL(ϑjϑj)(ϑi,μj)(ϑiμj)qjV(ϑi,μj)=L(ϑi,μj)+12tr(ΣjL(ϑjϑj))

This is useful when optimse for variational mode $\mu_i$.

§ Optimsation

To optimise the sufficient statistics for the variational density, $q$, under Laplace approximation, we only need to find their mean and covariance, which are illustrated in the preceding sections

μi=argmaxϑiV(ϑi)Σi=L(μ)(ϑjϑj)1

It is worth noting that the covariance is a function of its variational mode and does not depend on the mean-field approximation.

  • Optimise for $\mu_i$

The update scheme should read V(μi)(ϑi:k)=L(μ)(ϑi:k)+12tr(ΣjL(ϑjϑjϑi:k))V(μi)(ϑi:kϑi:l)=L(μ)(ϑi:kϑi:l)+12tr(ΣjL(ϑjϑjϑi:kϑi:l))Δμi=(exp(tJ)I)J1V(μi)(ϑi)V(μi)(ϑiϑi)1V(μi)(ϑi),if t

where superscript $(\vartheta_{i:k})$ denotes partial derivative to the $k$-th element of $\vartheta_i$. For detailed derivation see linearisation.

§ Implementational consideration

Suppose one has parameter that factorises into $\vartheta = \{\theta, \varphi\}$

and let $\theta=(\theta_1, \theta_2)^T$ and $\varphi = (\varphi_1, \varphi_2)^T$

V(φφ)=(V(φ1φ1)V(φ1φ2)V(φ2φ1)V(φ2φ2))=(L(φ1φ1)L(φ1φ2)L(φ2φ1)L(φ2φ2))+12(tr[ΣθL(θθφ1φ1)]tr[ΣθL(θθφ1φ2)]tr[ΣθL(θθφ2φ1)]tr[ΣθL(θθφ2φ2)])

Unpack $L^{(\theta\theta\varphi_1\varphi_1)}$ in the second term

(L(θ1θ1φ1φ1)L(θ1θ2φ1φ1)L(θ2θ1φ1φ1)L(θ2θ2φ1φ1))

It is straightforward to show that, for instance, $\tr\left[\Sigma_\theta L^{(\theta\theta\varphi_1\varphi_2)}\right]$ is equivalent to $\Sigma^T\!\!\!:_\theta^T L\!\!:^{(\theta\theta)(\varphi_1\varphi_2)}$, where $\Sigma\!\!:$ denotes matrix vectorisation, and $\Sigma^T\!\!\!:$ stands for vectorisation along columns.

Thus, $L\!\!:^{(\theta\theta)(\varphi\varphi)}$ becomes a matrix that reads

(L:(θθ)(φ1φ1)L:(θθ)(φ1φ2)L:(θθ)(φ2φ1)L:(θθ)(φ2φ2))=(L(θ1θ1φ1φ1)L(θ1θ1φ1φ2)L(θ2θ1φ1φ1)L(θ2θ1φ1φ2)L(θ1θ2φ1φ1)L(θ1θ2φ1φ2)L(θ2θ2φ1φ1)L(θ2θ2φ1φ2)L(θ1θ1φ2φ1)L(θ1θ1φ2φ2)L(θ2θ1φ2φ1)L(θ2θ1φ2φ2)L(θ1θ2φ2φ1)L(θ1θ2φ2φ2)L(θ2θ2φ2φ1)L(θ2θ2φ2φ2))

and one accordingly has

V(φφ)=L(φφ)+12(IΣT:θ)TL:(θθ)(φφ)

The same procedure can be repeated for $V^{(\theta\theta)}, V^{(\varphi)}$ and $V^{(\theta)}$ to give

V(θθ)=L(θθ)+12(IΣT:φ)TL:(φφ)(θθ)V(φ)=L(φ)+12(IΣT:θ)TL:(θθ)(φ)V(θ)=L(θ)+12(IΣT:φ)TL:(φφ)(θ)

§ Example 1

In this example we will attempt to optimise the following objective function using the updating scheme above.

$ L(u, v) = -\frac{1}{16} (u - \frac{1}{8}v^2)^2 - \frac{1}{128} (v^2 - 2)^2 $


In [1]:
import numpy as np
import scipy.stats as stat
import matplotlib.pyplot as plt
from IPython import display
import time
%matplotlib inline

In [2]:
import theano
import theano.tensor as T

print("Theano version: ", theano.__version__)


Using gpu device 0: Tesla K40c (CNMeM is enabled with initial size: 90.0% of memory, cuDNN Version is too old. Update to v5, was 2000.)
Theano version:  0.9.0dev1.dev-ec0419a6cf1d3667ece1d5e0f95498c229471f19

In [3]:
c0 = [1. / 16, 1. / 8, 1. / 128]
u, v = T.vectors(2)
Su, Sv = T.matrices(2)  # this is not necessary, only for the purpose of tutorial

L = (-c0[0] * (u - c0[1] * v ** 2) ** 2 - c0[2] * (v ** 2 - 2) ** 2)[0]

In [4]:
fSu, fSv = Su.flatten(1), Sv.flatten(1)

Lu = T.jacobian(L, u)
Lv = T.jacobian(L, v)

Luu = T.hessian(L, u).flatten(1)
Lvv = T.hessian(L, v).flatten(1)

Luuv = T.join(0, [T.jacobian(Luu[i], v) for i in range(1)])
Lvvu = T.join(0, [T.jacobian(Lvv[i], u) for i in range(1)])

Luuvv = T.join(0, [T.hessian(Luu[i], v).flatten(1) for i in range(1)])
Lvvuu = T.join(0, [T.hessian(Lvv[i], u).flatten(1) for i in range(1)])

Uu = T.join(0, [T.sum(fSv * Lvvu[:, i]) for i in range(1)])
Uv = T.join(0, [T.sum(fSu * Luuv[:, i]) for i in range(1)])

Uuu = T.join(0, [T.sum(fSv * Lvvuu[:, i]) for i in range(1)])
Uvv = T.join(0, [T.sum(fSu * Luuvv[:, i]) for i in range(1)])

Vu = Lu + 0.5 * Uu
Vv = Lv + 0.5 * Uv

Vuu = (Luu + 0.5 * Uuu).reshape((1, 1))
Vvv = (Lvv + 0.5 * Uvv).reshape((1, 1))

F = L + 0.5 * (Su.dot(Luu) + Sv.dot(Lvv)) + T.log(T.nlinalg.det(Su)) + T.log(T.nlinalg.det(Sv))


du = -(T.nlinalg.pinv(Vuu).dot(Vu))
dv = -(T.nlinalg.pinv(Vvv).dot(Vv))

# alternatively
# du = (T.exp(Vuu) - 1).dot(T.nlinalg.pinv(Vuu).dot(Vu))
# dv = (T.exp(Vvv) - 1).dot(T.nlinalg.pinv(Vvv).dot(Vv))

Cu = - T.nlinalg.pinv(Luu.reshape((1, 1)))
Cv = - T.nlinalg.pinv(Lvv.reshape((1, 1)))

In [5]:
update = theano.function([u, v, Su, Sv], [du, dv, Cu, Cv], allow_input_downcast=True)
elb = theano.function([u, v, Su, Sv], F, allow_input_downcast=True)

In [6]:
tu, tv = 1., 1.  # delta_u, delta_v

u0, v0, Su0, Sv0 = np.array([1.]), np.array([1.]), np.array([[1.]]), np.array([[1.]])
tCu, tCv = Su0, Sv0
tol = 0.001

In [7]:
plt.figure()
plt.hold(True)

step = 0
for maxM in range(256):
    plt.subplot(211)
    plt.plot(v0, u0, 'r*', alpha=1)
    for maxN in range(256):
        tu, tv, tCu, tCv = update(u0, v0, Su0, Sv0)
        u0 += tu
        v0 += tv
        
        plt.subplot(211)
        plt.plot(v0, u0, 'b.', alpha=.5);
        plt.subplot(212)
        plt.plot(step, elb(u0, v0, Su0, Sv0), 'r+');
        display.display(plt.gcf())
        display.clear_output(wait=True)
        time.sleep(0.)
        step += 1
        if (tu ** 2) < tol and (tv ** 2) < tol: break
    if np.sum((Su0 - tCu) ** 2) < tol and np.sum((Sv0 - tCv) ** 2) < tol: break
    Su0, Sv0 = tCu, tCv
print("Converged.")
print("u: ", u0, "v: ", v0, "Cu: ", Su0, "Cv: ", Sv0)


Converged.
u:  [ 0.99525321] v:  [ 1.63549357] Cu:  [[ 8.]] Cv:  [[ 5.28174448]]

§ Example 2

Instead of writing multiple lines of code, in this example we create a function vilaps() which allows us to reuse the updating scheme given different model/objective function.

This function uses Theano to generate a compiled updating scheme, with optional observation y and hyperprior hp arguments.


In [8]:
import scipy.stats as stat
import numpy as np
import theano
import functools
import theano.tensor as T

print("Theano version: ", theano.__version__)


Theano version:  0.9.0dev1.dev-ec0419a6cf1d3667ece1d5e0f95498c229471f19

In [9]:
def vilaps(L, b, bdims, y=None, hp=None):
    
    def _jac_(*args, **kwargs):
        return T.jacobian(*args, disconnected_inputs='warn', **kwargs)
    
    def _hes_(*args, **kwargs):
        return T.hessian(*args, disconnected_inputs='warn', **kwargs)
    
    assert len(b) == len(bdims)
    
    if y is None:
        y = []
    else:
        y = [y]
       
    if hp is None:
        hp = []
    
    mfp = len(bdims)
    
    Li  = [_jac_(L, b[i]) for i in range(mfp)]
    Lii = [_hes_(L, b[i]).flatten(1) for i in range(mfp)]
       
    Liij  = [[T.join(0, [_jac_(Lii[i][k], b[j]) for k in range(bdims[i] ** 2)])
              for j in range(mfp)]
             for i in range(mfp)]
    Liijj = [[T.join(0, [_hes_(Lii[i][k], b[j]).flatten(1) for k in range(bdims[i] ** 2)])
              for j in range(mfp)]
             for i in range(mfp)]
    
    Ci  = [- T.nlinalg.pinv(Lii[i].reshape((bdims[i], bdims[i]))) for i in range(mfp)]
    fCi = [Ci[i].flatten(1) for i in range(mfp)]
    
    Uj = [[T.join(0, [T.sum(fCi[j] * Liij[j][i][:, k])  for k in range(bdims[i])])
           for j in range(mfp)
           if j != i]
          for i in range(mfp)]
    
    Ujj = [[T.join(0, [T.sum(fCi[j] * Liijj[j][i][:, k]) for k in range(bdims[i] ** 2)])
            for j in range(mfp)
            if j != i]
           for i in range(mfp)]
    
    Vi  = [Li[i] + 0.5 * functools.reduce(lambda x, y: x + y, Uj[i]) for i in range(mfp)]
    Vii = [(Lii[i] + 0.5 * functools.reduce(lambda x, y: x + y, Ujj[i])).reshape((bdims[i], bdims[i]))
           for i in range(mfp)]
    
    bGN = [-T.nlinalg.pinv(Vii[i]).dot(Vi[i]) for i in range(mfp)]
    bLM = [(T.exp(Vii[i]) - 1).dot(T.nlinalg.pinv(Vii[i])).dot(Vi[i]) for i in range(mfp)]

    F = L
    for i in range(mfp):
        F += 0.5 * T.sum(fCi[i] * Lii[i] + T.log(T.nlinalg.det(Ci[i])))
        
    viupdate = theano.function(
        b + hp + y,
        bGN + bLM + Ci + [F],
        allow_input_downcast=True)
    
    return viupdate

This time we try to invert the following model that generates our observation $y$

$ p(y|μ,τ)=N(y|μ,τ1)p(μ|τ)=N(μ|μ0,(λ0τ)1)p(τ)=Ga(τ|a0,b0) $

where we assume mean-field approximation

$ q(\mu,\tau) = q(\mu)q(\tau) $

and Laplace assumption.


In [10]:
hv = [9., 3.]
hu = [5., 2.]
rngv = stat.gamma(a=hv[0], scale=hv[1])
N = 1000

Y = np.zeros((N, 1))
U = np.zeros(N)
V = np.zeros(N)

for s in range(N):
    v = rngv.rvs()
    rngu = stat.norm(loc=hu[0], scale=np.sqrt(1 / (hu[1] * v)))
    u = rngu.rvs()
    Y[s, :] = stat.norm(loc=u, scale=1 / v).rvs()
    U[s] = u
    V[s] = v

In [11]:
y = T.matrix()
u, v, pu, pv = T.vectors(4)

L1 = T.sum(-0.5 * (y - u).dot(T.nlinalg.matrix_inverse(T.diag(v))) * (y - u)) -\
    0.5 * ((u - pu[0]).dot(T.nlinalg.matrix_inverse(T.diag(pu[1] * v)))).dot(u - pu[0]) -\
    T.gammaln(pv[0]) * pv[1] ** pv[0] + (pv[0] - 1) * T.log(v) - v / pv[1]
L1 = T.sum(L1)

In [12]:
b = [u, v]
hp = [pu, pv]
bdims = [1, 1]

# The line below generates an warning (or raise an DisconnectedInput exception if the disconnected_input flag
# is otherwise not set for T.jacobian and T.hessian). This happens when AD takes a derivative of a constant.
viu = vilaps(L1, b, bdims, y, hp);


/home/yen/env/p34/lib/python3.4/site-packages/theano/gradient.py:533: UserWarning: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: <TensorType(float32, vector)>
  handle_disconnected(elem)
/home/yen/env/p34/lib/python3.4/site-packages/theano/gradient.py:559: UserWarning: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: <DisconnectedType>
  handle_disconnected(rval[i])

In [13]:
u0, v0 = np.array([2.]), np.array([5.])
tol = 0.0001

plt.figure()
for step in range(256):
    gnu, gnv, lmu, lmv, Cu, Cv, FE = viu(u0, v0, hu, hv, Y)
    tu, tv = gnu, gnv
    u0 += tu
    v0 += tv
    plt.subplot(211)
    plt.plot(u0, v0, 'bo', alpha=.5)
    plt.subplot(212)
    plt.plot(step, FE, 'r+')
    display.display(plt.gcf())
    display.clear_output(wait=True)
    if np.square(tu) < tol and np.square(tv) < tol: break



In [14]:
plt.figure(figsize=(16, 5))
plt.subplot(121)
plt.plot(U, V, 'b.', alpha=.1);
plt.plot(hu[0], np.prod(hv), 'r*', markersize=12, alpha=.7);
plt.plot(u0, v0, 'y*', markersize=12, alpha=.7);
plt.grid(True)
plt.subplot(122)
plt.plot(U, V, 'b.', alpha=.1);
plt.plot(hu[0], np.prod(hv), 'r*', markersize=12, alpha=.7);
plt.plot(u0, v0, 'y*', markersize=12, alpha=.7);
plt.grid(True)
plt.xlim([4.8, 5.2])
plt.ylim([20, 40])
plt.legend(["sample", "true", "approx"]);


Out[14]:
<matplotlib.legend.Legend at 0x7fd97931f358>

§ Example 3

This example repeats Example 1 using the function we just wrote.


In [15]:
c0 = [1. / 16, 1. / 8, 1. / 128]
u, v = T.vectors(2)
L2 = (-c0[0] * (u - c0[1] * v ** 2) ** 2 - c0[2] * (v ** 2 - 2) ** 2)[0]

viu = vilaps(L2, [u, v], [1, 1], hp=None, y=None)

In [21]:
plt.figure()
u0 = [1.]
v0 = [1.]
tol = 0.0001

for step in range(256):
    gnu, gnv, lmu, lmv, Cu, Cv, FE = viu(u0, v0)
    tu, tv = gnu, gnv
    u0 += tu
    v0 += tv
    plt.subplot(211)
    plt.plot(u0, v0, 'bo', alpha=.5)
    plt.subplot(212)
    plt.plot(step, FE, 'r+')
    display.display(plt.gcf())
    display.clear_output(wait=True)
    if np.all(np.square([tu, tv]) < tol): break
print(u0, v0)


[ 0.99625185] [ 1.63277395]

A.1 Multivariate normal distribution

Say $\pmb x \sim N(\pmb\mu, \pmb\Lambda^{-1})$

Then,

$ \ln p(\pmb x) = -\frac D 2\ln 2\pi - \frac 1 2 \ln|\pmb\Lambda^{-1}| - \frac 1 2 (\pmb x-\pmb\mu)^T\pmb\Lambda(\pmb x-\pmb\mu) $

$ H(\pmb x) = \langle -\ln \pmb x\rangle = \frac D 2\ln 2\pi e + \frac 1 2 \ln |\pmb\Lambda^{-1}| $

$ D = \dim(\pmb x) $

A.2 Linearisation

The following derives a local linearisation of a dynamical system, proposed by Ozaki (1992).

Suppose we have a stochastic dynamical system with a continuous time Gaussian white noize $n(t)$.

˙y(t)=f(y(t)))+n(t)

At first, we approximate the equation by a Gaussian process,

˙y(t)=Kty(t)+n(t)

with some appropriate $K_t$ on a small interval $[t, t+\Delta t)$.

From the assumption that the Jacobian of the linear part of $(2)$ over each interval $[t, t+\Delta t)$ is given by the Jacobian $J_t = \frac{\partial f(y)}{\partial y}$ at $y = y_t$ of the original dynamical system, we have the following equation by using chain rule,

¨y(s)=d˙y(s)ds=f(y)yys=Jt˙y(s)

for $t \leq s \leq t + \Delta t$.

By integrating (3) on $[t, t + \tau)(0 \leq \tau < \Delta t)$ we have

˙y(t+τ)=exp(Jtτ)˙y(t)=exp(Jtτ)f(y(t))

Integrating the equation $(4)$ again with respect to $\tau$ on $[0, \Delta t)$ utilizing the method of integration by parts, finally, we get the following equation which gives the value of $y(t+\Delta t)$ as a function of $y(t)$.

y(t+Δt)=y(t)+J1t{exp(JtΔt)1}f(y(t))
  • Ozaki T (1992) "A bridge between nonlinear time series models and nonlinear stochastic dynamical systems: a local linearization approach", Statistics Sinica 2, 1, 113-135.